首先安裝套件
install.packages("elastic")
接下來到網頁取得 Elastic Endpoint
然後在當初 Elastic Cloud 設定完成時,Elastic 會給你一組帳密進行連線,此處我們假設為
user = "elastic"
pwd = "abc123"
接下來要在 R 進行設定
這邊我們先假設取得的 Elastic Endpoint 是 "https://abc123.asia-east1.gcp.elastic-cloud.com:9243"
如果依照套件官網的說明,你可能會如下編寫
# Connect Error
x <- connect(
host = "https://abc123.asia-east1.gcp.elastic-cloud.com:9243",
path = "",
user = "elastic",
pwd = "abc123",
port = 9243,
transport_schema = "https"
)
或是
# Connect Error
x <- connect(
host = "https://abc123.asia-east1.gcp.elastic-cloud.com",
path = "",
user = "elastic",
pwd = "abc123",
port = 9243,
transport_schema = "https"
)
但都會得到以下的錯誤訊息
Found http or https on es_host, stripping off, see the docs
正確的寫法應該是
# Connect OK!
x <- connect(
host = "abc123.asia-east1.gcp.elastic-cloud.com",
path = "",
user = "elastic",
pwd = "abc123",
port = 9243,
transport_schema = "https"
)
重點就是在 host 設定時,要保持單純,不需要 port 與 https 的資訊。
設定完成後,我們可以先來進行第一次查詢,先試著回傳在 Index "elastic_json"的第一筆資料
library("elastic")
# Connect to Elastic Cloud
x <- connect(
host = "abc123.asia-east1.gcp.elastic-cloud.com",
path = "",
user = "elastic",
pwd = "abc123",
port = 9243,
transport_schema = "https"
)
# My First Query
query1record <- Search(x, index = "elastic_json", size = 1)$hits$hits
此處我們可以看到資料會以 List 的方式回傳至 R,可以使用 typeof 或 str 進行檢查。
那如果我們要像昨天一樣進行星星數大於2000?的查詢呢?
首先讓我們快速看一下,在 elastic 套件中提供了那些的操作:
ls('package:elastic')
查詢結果如下,可以發現CURD的功能都有提供:
[1] "alias_create" "alias_delete" "alias_exists"
[4] "alias_get" "alias_rename" "aliases_get"
[7] "cat_" "cat_aliases" "cat_allocation"
[10] "cat_count" "cat_fielddata" "cat_health"
[13] "cat_indices" "cat_master" "cat_nodeattrs"
[16] "cat_nodes" "cat_pending_tasks" "cat_plugins"
[19] "cat_recovery" "cat_segments" "cat_shards"
[22] "cat_thread_pool" "cluster_health" "cluster_pending_tasks"
[25] "cluster_reroute" "cluster_settings" "cluster_state"
[28] "cluster_stats" "connect" "count"
[31] "docs_bulk" "docs_bulk_create" "docs_bulk_delete"
[34] "docs_bulk_index" "docs_bulk_prep" "docs_bulk_update"
[37] "docs_create" "docs_delete" "docs_delete_by_query"
[40] "docs_get" "docs_mget" "docs_update"
[43] "docs_update_by_query" "es_parse" "explain"
[46] "field_caps" "field_mapping_get" "field_stats"
[49] "index_analyze" "index_clear_cache" "index_close"
[52] "index_create" "index_delete" "index_exists"
[55] "index_flush" "index_forcemerge" "index_get"
[58] "index_open" "index_optimize" "index_recovery"
[61] "index_recreate" "index_segments" "index_settings"
[64] "index_settings_update" "index_shrink" "index_stats"
[67] "index_status" "index_template_delete" "index_template_exists"
[70] "index_template_get" "index_template_put" "index_upgrade"
[73] "info" "mapping_create" "mapping_delete"
[76] "mapping_get" "mlt" "msearch"
[79] "mtermvectors" "nodes_hot_threads" "nodes_info"
[82] "nodes_shutdown" "nodes_stats" "percolate_count"
[85] "percolate_delete" "percolate_list" "percolate_match"
[88] "percolate_register" "ping" "pipeline_attachment"
[91] "pipeline_create" "pipeline_delete" "pipeline_get"
[94] "pipeline_simulate" "reindex" "scroll"
[97] "scroll_clear" "Search" "search_shards"
[100] "Search_template" "Search_template_delete" "Search_template_get"
[103] "Search_template_register" "Search_template_render" "Search_uri"
[106] "tasks" "tasks_cancel" "termvectors"
[109] "tokenizer_set" "type_exists" "type_remover"
[112] "validate"
今天先讓我們以 Search 進行我們我們的操作,操作的方式基本上與昨天相同,在 q 中進行設定。
# 星星數大於2000
star_Over_2000 <- Search(x, index = "elastic_json", q = "stargazers_count:>2000")$hits$hits
# 星星數大於199
star_Over_199 <- Search(x, index = "elastic_json", q = "stargazers_count:>199")$hits$hits
# 星星數大於20
star_Over_20 <- Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$hits
讓我們檢查一下回傳的資料筆數是否正確。我們在昨天已經知道,星星數大於2000的有2筆。
> length(star_Over_2000)
[1] 2
> length(star_Over_199)
[1] 10
> length(star_Over_20)
[1] 10
咦! 為什麼星星數大於 199 與 20的都只有10筆呢,這不太合理吧?
後來到 Elastic Cloud API Console 進行確認,的確是預設只傳前10筆完整的明細資料,而不會回傳全部。
所以如果要更正確的直接取得聚合後的資料,只需要將程式碼改為
Search(x, index = "elastic_json", q = "stargazers_count:>200")$hits$total$value
Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value
Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value
# 星星數大於2000
star_Over_2000 <- Search(x, index = "elastic_json", q = "stargazers_count:>2000")$hits$total$value
# 星星數大於199
star_Over_199 <- Search(x, index = "elastic_json", q = "stargazers_count:>199")$hits$total$value
# 星星數大於20
star_Over_20 <- Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value
再檢查一次結果,這下子就合理多了。Repos的星星數大於2000的有2筆,星星數大於199的有13筆,星星數大於20的有33筆:
> star_Over_2000
[1] 2
> star_Over_199
[1] 13
> star_Over_20
[1] 33
今天我們透過R語言,直接對 Elastic Cloud 上的資料進行蒐尋,後續我們再找時間實作如何完成其他 CURD 的操作。